Goto

Collaborating Authors

 proposition 5


MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations

Neural Information Processing Systems

We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality -- containing both expert and suboptimal trajectories. Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning, designed jointly to enable effective learning from heterogeneous, unlabeled data. In the first stage, we combine advances in large language models and preference-based reinforcement learning to construct a progressive labeling pipeline that distinguishes expert-quality trajectories. In the second stage, we introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies while addressing the computational complexity of large joint state-action spaces. By extending the popular single-agent DICE framework to multi-agent settings with a new value decomposition and mixing architecture, our method yields a convex policy optimization objective and ensures consistency between global and local policies. We evaluate MisoDICE on multiple standard multi-agent RL benchmarks and demonstrate superior performance, especially when expert data is scarce.


Regularized least squares learning with heavy-tailed noise is minimax optimal

Neural Information Processing Systems

This paper examines the performance of ridge regression in reproducing kernel Hilbert spaces in the presence of noise that exhibits a finite number of higher moments. We establish excess risk bounds consisting of subgaussian and polynomial terms based on the well known integral operator framework. The dominant subgaussian component allows to achieve convergence rates that have previously only been derived under subexponential noise--a prevalent assumption in related work from the last two decades. These rates are optimal under standard eigenvalue decay conditions, demonstrating the asymptotic robustness of regularized least squares against heavy-tailed noise. Our derivations are based on a Fuk-Nagaev inequality for Hilbert-space valued random variables.


Revisiting Follow-the-Perturbed-Leader with Unbounded Perturbations in Bandit Problems

Neural Information Processing Systems

Follow-the-Regularized-Leader (FTRL) policies have achieved Best-of-BothWorlds (BOBW) results in various settings through hybrid regularizers, whereas analogous results for Follow-the-Perturbed-Leader (FTPL) remain limited due to inherent analytical challenges. To advance the analytical foundations of FTPL, we revisit classical FTRL-FTPL duality for unbounded perturbations and establish BOBW results for FTPL under a broad family of asymmetric unbounded Fréchettype perturbations, including hybrid perturbations combining Gumbel-type and Fréchet-type tails. These results not only extend the BOBW results of FTPL but also offer new insights into designing alternative FTPL policies competitive with hybrid regularization approaches. Motivated by earlier observations in two-armed bandits, we further investigate the connection between the 1/2-Tsallis entropy and a Fréchet-type perturbation. Our numerical observations suggest that it corresponds to a symmetric Fréchet-type perturbation, and based on this, we establish the first BOBW guarantee for symmetric unbounded perturbations in the two-armed setting. In contrast, in general multi-armed bandits, we find an instance in which symmetric Fréchet-type perturbations violate the key condition for standard BOBW analysis, which is a problem not observed with asymmetric or nonnegative Fréchet-type perturbations. Although this example does not rule out alternative analyses achieving BOBW results, it suggests the limitations of directly applying the relationship observed in two-armed cases to the general case and thus emphasizes the need for further investigation to fully understand the behavior of FTPL in broader settings.


An Efficient Orlicz-Sobolev Approach for Transporting Unbalanced Measures on a Graph

Neural Information Processing Systems

We investigate optimal transport (OT) for measures on graph metric spaces with different total masses. To mitigate the limitations of traditional Lp geometry, Orlicz-Wasserstein (OW) and generalized Sobolev transport (GST) employ Orlicz geometric structure, leveraging convex functions to capture nuanced geometric relationships and remarkably contribute to advance certain machine learning approaches. However, both OW and GST are restricted to measures with equal total mass, limiting their applicability to real-world scenarios where mass variation is common, and input measures may have noisy supports, or outliers. To address unbalanced measures, OW can either incorporate mass constraints or marginal discrepancy penalization, but this leads to a more complex two-level optimization problem. Additionally, GST provides a scalable yet rigid framework, which poses significant challenges to extend GST to accommodate nonnegative measures.


Triangular-Reference Schrödinger Bridges for Time Series Generation

arXiv.org Machine Learning

We introduce Triangular-Reference Schrödinger Bridges for Time Series (TR-SBTS), a conservative extension of the SBTS framework in which the Brownian reference is replaced by an intervalwise frozen, possibly degenerate diffusion reference, triangular across a hierarchy of latent volatility levels. The construction is a single entropy projection on the augmented state space, with the variational constraint imposed jointly across time and the latent levels and unfolded hierarchically by the disintegration of relative entropy. The variational core of SBTS is preserved: the entropy minimiser is the h-transform of the reference, and on each frozen interval the optimal dynamics admit a logarithmic-gradient drift formula on the affine leaves of the active covariance directions, valid even when the frozen covariance is rank-deficient. We establish stability of the frozen approximation and convergence of the corresponding regularised kernel estimators. The construction is realised through a finite-dimensional conditioning map assembled from three complementary reductions of the past -- a block PCR summary, a reference-aware Mahalanobis kernel on past increments induced by the runtime frozen covariance cumulants, and a past-window WLS drift regressor under the same reference metric -- together with a coupled state-covariance bridge step in which each latent level produces a dynamic reference for the level above, summarised by a covariance descriptor; the construction is evaluated on numerical experiments.


Randomized Subspace Nesterov Accelerated Gradient

arXiv.org Machine Learning

Randomized-subspace methods reduce the cost of first-order optimization by using only low-dimensional projected-gradient information, a feature that is attractive in forward-mode automatic differentiation and communication-limited settings. While Nesterov acceleration is well understood for full-gradient and coordinate-based methods, obtaining accelerated methods for general subspace sketches that use only projected-gradient information and can improve over full-dimensional Nesterov acceleration in oracle complexity is technically nontrivial. We develop randomized-subspace Nesterov accelerated gradient methods for smooth convex and smooth strongly convex optimization under matrix smoothness and generic sketch moment assumptions. The key technical ingredient is a three-sequence formulation tailored to matrix smoothness, which recovers the corresponding classical Nesterov methods in the full-dimensional case. The resulting theory establishes accelerated oracle-complexity guarantees and makes explicit how matrix smoothness and the sketch distribution enter the complexity. It also provides a unified basis for comparing sketch families and identifying when randomized-subspace acceleration improves over full-dimensional Nesterov acceleration in oracle complexity.


Transformer Approximations from ReLUs

arXiv.org Machine Learning

We present a systematic recipe for translating ReLU approximation results to softmax Transformers1. Given a constructive ReLU approximator for a target, we construct an explicit softmax transformer with the same accuracy. The recipe applies to many common approximation targets and yields quantitative resource bounds beyond universal approximation statements. This matters because broad Universal Approximation Properties (UAP) still dominate Transformer approximation theory. For softmax Transformer, many universality results provide explicit constructions and quantitative resource bounds (e.g., parameters, depth, width...etc) [Yun et al., 2020, Kajitsuka and Sato, 2023, Takakura and Suzuki, 2023, Jiang and Li, 2024, Hu et al., 2025,


When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold

arXiv.org Machine Learning

The central limit theorem (CLT) is a foundation of statistical inference: it provides the asymptotic distribution needed for confidence intervals, hypothesis tests, and efficiency comparisons [24, 42]. For iterate-averaged stochastic gradient methods, it specifies both a Gaussian limit and its sandwich covariance in a single theorem statement. This foundation now underpins inference in streaming and online settings--online A/B testing, continual monitoring of treatment effects, and streaming M-estimation, for example--where the estimator is updated one observation at a time and inference must be performed in real time. A line of recent work develops online inference procedures for averaged SGD [10, 23, 46]. In practice, one-pass stochastic optimization is routinely combined with adaptive preconditioning, which improves computational efficiency and is believed to sharpen the resulting Gaussian approximation in finite samples. If the CLT fails or the asymptotic variance is altered by the adaptive preconditioning, all downstream inference-- coverage of confidence intervals, size of hypothesis tests, consistency of plug-in covariance estimators--is compromised. A rigorous understanding of when adaptive preconditioning preserves the CLT is, therefore, a prerequisite for reliable inference in these settings.


Parameter Tuning

Neural Information Processing Systems

If observations from the joint distribution of (A,Y,Z,W) are available in both stages, we can tune the regularization parameters λ1,λ2 using the approach proposed in Singh et al. [30], Xu et al. [35]. Let the complete data of stage 1 and stage 2 be denoted as (ai,yi,zi,wi) and ( ai, yi, zi, wi). Then, we can use the data not used in each stage to evaluate the out-of-sample performance of the other stage. A(2), ˆV(T),u(T) are the learned parameters by Algorithm 1. In this appendix, we prove propositions given in the main text. In the following, we assume that the spaces U, A, Z,W are separable and completely metrizable topological spaces and equipped with Borel σ-algebras. In this section, we use the notation PA|Z=z to express the distribution of a random variable Agiven another variable Z = z.